Integrate Terminal Bench Evaluation #1154

XinyuJiangCMU · 2025-12-19T17:49:25Z

📝 PR Description: Integrate Terminal Bench into Slime

📝 Summary

This PR fully integrates Terminal Bench (TB) into the Slime framework, enabling end-to-end agent evaluation capabilities within the system.

Structure: Implemented TB eval server/client and configuration templates under examples/eval/terminal_bench.
Pipeline: Successfully hooked TB into eval_delegate, ensuring metrics are correctly parsed and reported to W&B.
Docs: Provided comprehensive English and Chinese quickstart guides and example configuration files.

✅ Checklist

Server: tb_server.py implemented (Host-side).
Client: tb_client.py implemented (Container-side).
Networking: Verified Slime-to-Host network connectivity.
Integration: Logic wired into eval_delegate.py.
Documentation: Updated guide to reflect the Container-Host architecture.
Test: End-to-end tests passed.

⏳ To-do

Support Terminal-Bench 2.0
The current integration targets TB v1.0 via the tb run CLI; the workflow will be extended to support TB v2.0 based on harbor run.
Support configurable agents
The TB server currently hard-codes the terminus-2 agent; agent selection will be made configurable to support additional agents.
Add dataset selection support to TB server
The server currently uses the default terminal-bench-core dataset; a -d / --dataset argument will be added to enable evaluation on other registered datasets.
Expand evaluation coverage to more models
End-to-end validation has been performed on Qwen3-8B and Qwen3-32B; evaluations will be extended to additional models.

🤝 Collaborators

guapisolo · 2025-12-30T21:15:17Z

examples/eval/terminal_bench/README.md

Add a model download command and ckpt conversion here.

guapisolo · 2025-12-30T22:41:59Z

examples/eval/scripts/eval_tb_smoke.yaml

+  delegate:
+    - name: terminal_bench
+      # type: examples.eval.terminal_bench.tb_config.build_terminal_bench_config
+      url: http://172.17.0.1:9052


Add a comment that this port should match with the tb server in host machine

guapisolo · 2025-12-30T22:42:21Z

examples/eval/scripts/eval_tb_smoke.yaml

+      timeout_secs: 86400  # 24 hours
+      max_retries: 1 # HTTP request retries from Slime to the TB server
+      model_name: qwen3-8b
+      api_base: http://127.0.1.1:30005/v1


Add a comment that this port should match with sglang router port

guapisolo · 2025-12-30T22:42:44Z

examples/eval/scripts/eval_tb_smoke.yaml

+      max_retries: 1 # HTTP request retries from Slime to the TB server
+      model_name: qwen3-8b
+      api_base: http://127.0.1.1:30005/v1
+      dataset_path: /mnt/data/xinyu/program/slime-tb/terminal-bench/tasks


Comment: This is the dataset path in host machine

Added this in the quick-start README.

guapisolo · 2025-12-30T22:43:22Z

examples/eval/scripts/run-eval-tb-qwen.sh

+ray start --head --node-ip-address ${MASTER_ADDR} --port 6380 --num-gpus 2 \
+            --disable-usage-stats \
+            --dashboard-host=0.0.0.0 \
+            --dashboard-port=8266 \


Add comment here. About port conflict

Added this in the quick-start README.

guapisolo · 2025-12-30T22:43:42Z

examples/eval/terminal_bench/config/local_cluster.yaml

@@ -0,0 +1,12 @@
+# Minimal Terminal Bench delegate config for running on the host (no containers).


Do we need to keep this?

Not used anywhere, removed.

guapisolo · 2025-12-30T22:44:28Z

examples/eval/terminal_bench/README.md

+  --ulimit stack=67108864 \
+  --ulimit nofile=65536:65536 \
+  -v ~/.cache:/root/.cache \
+  -v $(pwd)/slime:/opt/slime \


There is some error when mount /opt folder in slime docker.. change to another path like /shared

Switched the mount to /shared to avoid /opt issues. Thanks for pointing this out.

guapisolo · 2025-12-30T22:45:39Z

examples/eval/terminal_bench/tb_config.py

+
+
+    @classmethod
+    def parse(cls, args, raw_env_config: Mapping[str, Any], defaults: Mapping[str, Any]) -> TerminalBenchConfig:


Is there any better way to impl this?

Thanks for the suggestion. I refactored the implementation to reduce repetition by using a field to cast mapping with a loop. Please let me know if this looks reasonable.

guapisolo · 2026-01-03T03:39:48Z

LGTM. Good job.

guapisolo · 2026-01-03T03:45:49Z

@zhuzilin Hi Zilin, I think this PR generally looks good with minimum invasions. And we've test its functionality on different machines. Do you have other suggestions?

- Integrates **Terminal Bench** as an eval delegate for **Slime**, enabling evaluation via an external TB server. - Adds a minimal **smoke eval config** and an example **Qwen3-8B** launch script for quick end-to-end testing. - Provides client/server support for submitting eval jobs, polling status, and collecting metrics from Terminal Bench. Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>

Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>

Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu> Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>

Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>

JessicaJiang-123 · 2026-01-14T02:33:56Z

⏳ To-do

Support Terminal-Bench 2.0
The current integration targets TB v1.0 via the tb run CLI; the workflow will be extended to support TB v2.0 based on harbor run.
Support configurable agents
The TB server currently hard-codes the terminus-2 agent; agent selection will be made configurable to support additional agents.
Add dataset selection support to TB server
The server currently uses the default terminal-bench-core dataset; a -d / --dataset argument will be added to enable evaluation on other registered datasets.
Expand evaluation coverage to more models
End-to-end validation has been performed on Qwen3-8B and Qwen3-32B; evaluations will be extended to additional models.

zhuzilin · 2026-01-16T13:29:22Z

I'm not sure if this is a suitable PR for slime... Because it seems mainly an introduction on how to use terminal bench to do evaluation and does not seem to show any special capability of slime.

The goal of slime is not to support the evaluation of all main stream benchmarks or recommend certain evaluation pipeline.

I'll close this with the same reason as #1025.

XinyuJiangCMU changed the title ~~Add TerminalBench eval delegate + quickstart~~ [WIP] Add TerminalBench eval delegate + quickstart Dec 19, 2025

guapisolo reviewed Dec 30, 2025

View reviewed changes

XinyuJiangCMU changed the title ~~[WIP] Add TerminalBench eval delegate + quickstart~~ Integrate Terminal Bench Evaluation Jan 1, 2026

XinyuJiangCMU marked this pull request as ready for review January 3, 2026 03:07

Xinyu Jiang and others added 6 commits January 14, 2026 01:17

Add TerminalBench eval scaffold

b463c56

successfully integrate tb in slime delegate eval with train

5d15d58

Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>

write quick-start for slime + tb delegate eval

a6ee81b

Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu> Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>

modify code and quick-start based on review comments

3cad96e

Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>

add README-cn.md

98facc6

Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>

JessicaJiang-123 force-pushed the feat/tb-eval-integration branch from 1fd519d to 98facc6 Compare January 14, 2026 01:26

zhuzilin closed this Jan 16, 2026

		@@ -0,0 +1,12 @@
		# Minimal Terminal Bench delegate config for running on the host (no containers).



		@classmethod
		def parse(cls, args, raw_env_config: Mapping[str, Any], defaults: Mapping[str, Any]) -> TerminalBenchConfig:

Integrate Terminal Bench Evaluation #1154

Integrate Terminal Bench Evaluation #1154

Uh oh!

Conversation

XinyuJiangCMU commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 PR Description: Integrate Terminal Bench into Slime

📝 Summary

✅ Checklist

⏳ To-do

🤝 Collaborators

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guapisolo Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XinyuJiangCMU Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guapisolo commented Jan 3, 2026

Uh oh!

guapisolo commented Jan 3, 2026

Uh oh!

JessicaJiang-123 commented Jan 14, 2026

⏳ To-do

Uh oh!

zhuzilin commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

XinyuJiangCMU commented Dec 19, 2025 •

edited

Loading

guapisolo Dec 30, 2025 •

edited

Loading

XinyuJiangCMU Jan 1, 2026 •

edited

Loading

zhuzilin commented Jan 16, 2026 •

edited

Loading